As previously discussed, the transport of X protocol between the client and server requires considerable CPU resources. A shared memory transport for X provides a faster means of transferring X11 protocol between local X clients and the X server by avoiding copying X protocol buffers. Many workstation vendors provide some sort of shared memory transport and typically report X benchmark results based on shared memory transport results.
Details on shared memory transport implementations are not well published because these transports often rely on proprietary operating system features and are treated as proprietary performance advantages.
In practice, the implementation of a shared memory transport is more involved than one might imagine. A simple reader/writer shared memory queue with semaphores is unacceptable because the X server must not be allowed to hang indefinitely on a semaphore or spinlock used to arbitrate access to the queue. The X server must fairly service requests and timeouts from multiple input sources. It must also robustly handle asynchronous client death and corruption of the shared memory queue.
Other logistical problems exist. The reasonable assumption of the MIT
sample X server [1] is that each request is dispatched
from a complete and contiguous request buffer. The X server could
construct a complete and contiguous request by coalescing partial
buffers from the shared memory transport into a contiguous
buffer, but this obviates much of the benefit of the shared memory
transport for large requests (where a shared memory transport is most beneficial)
since the request data must be
copied. For this reason, a shared memory transport implementation is
likely to allocate a shared memory queue as large as the largest X
request (most servers support the X protocol limit of a quarter megabyte). To obtain better
performance, some vendors allocate even larger shared memory queues. The
result is a transport that performs well for benchmarks but has
poor locality of reference within the shared memory queue.
In a benchmark situation where a single shared memory connection is in
use, this is not a problem. But if every local X client on the system
is cycling through their individual shared memory queues, the result is
poor memory system behavior hindering overall system performance.
Excepting a few important requests like GetImage and GetProperty, the server-to-client direction is not as critical to X performance as the client-to-server direction. This makes it likely that a shared memory transport will use shared memory only for the client-to-server direction. The Xlib necessity for a select-able file descriptor and the need to detect transport failure makes the use of a standard TCP or Unix domain transport likely for the server-to-client direction. Such an implementation decision restricts the performance benefit of a shared memory transport to the (admittedly more important) client-to-server direction.
Aside from the implementation issues with a shared memory transport, the performance potential from a shared memory transport is limited. While a shared memory transport reduces the transport costs, it does not reduce the two other system costs of window system interactions: context switching and protocol packing and unpacking. A shared memory transport may also introduce new memory system strains.
Table 1 indicates a number of important performance cases that a shared memory transport does not address. Notice that operations requiring a reply have high operating system overhead (80% for GetProperty; over 90% for large GetImages). Such cases demonstrate the cost of context switching and sending data from server to client because each Xlib call waits for a reply, forcing a round trip between the client and server. Also notice that in no case is the kernel bcopy overhead larger than 25%. This places a fairly low upper bound on the performance potential of a shared memory transport because kernel data copies are the fundamental savings from a shared memory transport.